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Abstract 

In this work we derive fundamental limits for many linear and non-linear sparse signal processing 
fT^ models including group testing, quantized compressive sensing, multivariate regression and observations 

T-H with missing features. In general sparse signal processing problems can be characterized in terms of 

^^ the following Markovian property. We are given a set of A'^ variables Xi,X2, . . . ,Xn, and there is an 

CN unknown subset S C {1, 2, . . . , A''} that are relevant for predicting outcomes/outputs Y. In other words, 

^ when Y is conditioned on {Xk}kes it is conditionally independent of the other variables, {Xk}k^s- 

Ch Our goal is to identify the set S from samples of the variables X and the associated outcomes 

Y. We characterize this problem as a version of the noisy channel coding theorem. Using asymptotic 
information theoretic analyses, we describe mutual information formulas that provide sufficient and 
necessary conditions on the number of samples required to successfully recover the salient variables. This 
I I mutual information expression unifies conditions for both linear and non-linear observations. We then 

h-H compute sample complexity bounds based on the mutual information expressions for different settings 

KH including group testing, quantized compressive sensing, multivariate regression and observations with 

C/) missing features. 

^ 1 Introduction 

> 

CNl Recent advances in sensing and storage systems have led to the proliferation of high-dimensional data such 

OO as images, video or genomic data. Such data cannot be processed efhciently using conventional signal pro- 

jy cessing methods due to their dimensionality. However, high-dimensional data often exhibit an inherent 

^^ low-dimensional structure, so they can often be represented "sparsely" in some basis or domain. The dis- 

\l covery of an underlying sparse structure is important in order to compress the acquired data or to develop 

more robust and efficient processing algorithms. 

In this paper, we are concerned with the asymptotic analysis of the sample complexity in problems 
where we aim to identify a set of salient variables responsible for producing an outcome. In particular, we 
assume that among a set of TV independent and identically distributed (i.i.d.) variables/features/covariates 
X = {Xi, . . . , Xpf), only K variables (indexed by set S) are directly relevant to the outcome Y. We formulate 
J-j this with the assumption that given Xs ~ {Xn}nes^ outcome Y is independent of other variables {Xn}n<^s, 

i.e., 

P{Y\X) = P{Y\Xs). (1) 

We assume we are given T sample pairs {X, Y) and the problem is to identify the set of salient variables, S, 
from these T samples given the knowledge of observation model P(Y\Xs). Our analysis aims to establish 
sufficient conditions on T in order to recover the set S with an arbitrarily small error probability in terms 
of K, N, the observation model and other model parameters such as the signal-to-noise ratio. We limit 
our analysis to the i.i.d. setting in this paper for simplicity. It turns out that our analysis methods can 
be extended to the more general dependent setting at the cost of additional terms in our formulas that 
compensate for dependencies between variables. 

The analysis of the sample complexity is performed by posing this identification problem as an equivalent 
channel coding problem. The salient set S corresponds to the message transmitted through a channel. The 
set S is encoded by Xg of length T, which is the collection of codewords X^ for n E S, from a codebook 
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Figure 1: Channel model. 



X^ . The coded message Xg is transmitted through a channel P{Y\Xs) with output Y'^ . As in channel 
coding, our aim is to identify which message S was transmitted given channel output Y"'" and the codebook 
X^. 

The sufBciency and necessity results we present in this paper are analogous to the channel coding theorem 
for memoryless channels }10j . Our results are of the form 



TI{Xs;Y)> log 



(2) 



which can be interpreted as follows: The right side of the inequality is the number of bits required to represent 
all sets S of size K. On the left side, the mutual information term represents the uncertainty reduction on 
the output Y when given the input Xg, in bits per sample. This term essentially quantifies the "capacity" 
of the observation model P{Y\Xs)- Then, the total uncertainty reduction through the T samples should 
exceed the uncertainty of possible salient sets S, in order to reliably recover the salient set. 

Sparse signal processing models analyzed in this paper have wide applicability. Below we list some 
examples of problems which can be formulated in the described framework. 

Linear models arise naturally in array processing where the output Y is obtained as a linear (possibly 
noisy) transformation of some input X. 

Compressive sensing (CS) [12] is a signal processing technique which aims to reconstruct a sparse signal 
from underdetermined linear systems. In compressed sensing, it is assumed that the output vector Y can 
be obtained from a iiT-sparse vector (3 through some linear transformation with basis matrix X, i.e., in the 
noisy case with noise W, Y = Xf3 + W. Quantized versions of the problem are also investigated, where the 
channel model also includes a quantization of the output. The CS model with an example is illustrated in 
Figure [2] Note that, in contrast to the general CS convention, in our analysis the columns of the sensing 
matrix correspond to the variables X, where the support of sparse vector corresponds to the set S and the 
coefficients in the support are absorbed to the channel model. 

Models vifith Missing Features [20 : Our methods also provide sample complexity bounds for sparse 
signal processing problems with missing features. The problem here is that the some of the variables for 
some of the measurements Y , could be missing. Specifically, we observe aT x N matrix Z instead of X , 
with the relation 



Z. 
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W.p. 1 — (0 
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Vie{i,...,iV},te{i,...,r} 



i.e., we observe a version of the feature matrix which may have entries missing with probability p, indepen- 
dently for each entry. Interestingly our analysis shows that the sample complexity, T„iiss for problems with 
missing features is related to the sample complexity, T, of the fully observed case with no missing features 
by the following simple expression: 

T 

T ■ — 

1-p 

Group testing [4] is a form of compressive sensing with Boolean arithmetic. As an example, group testing 
has been used for medical screening to identify a set of people who have a certain disease in a large population 
while reducing the total number of tests. The idea is to pool blood samples from subsets of people and to test 
them simultaneously rather than conducting a separate blood test for each individual. In an ideal setting, 
the result of a test is positive if and only if the subset contains a positive sample. A significant part of 
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Figure 2: Compressive Sensing model example and its mapping to the channel model. 



the existing research is focused on combinatorial pool design to guarantee detection using a small number 
of tests. Several variants of the problem exist, such as noisy group testing with different types of errors. 
An interesting variant is the graph-constrained group testing problem, where the salient set is the set of 
defective links in a graph and each test is a random walk on the graph [7]. The model can be represented 
graphically as in Figure |3J where X is a Boolean testing matrix and Y is the outcome vector. Again, the 
different columns of the testing matrix correspond to the variables X, while the defective set corresponds to 
set S. 

Sparse Channel Estimation [9, is used for the estimation of multi-path channels characterized by sparse 
impulse responses. The output of the channel depends on the input time instances which correspond to the 
non-zero coefficients of the impulse response. In an equivalent channel model, the indices of the non-zero 
coefficients in the impulse response correspond to the encoded set S and the coefficients themselves are 
absorbed into the channel model. 

Among the examples stated above, compressive sensing in particular is a fairly well-studied problem. The 
conditions for recovery in linear CS with measurement noise has been described and studied extensively in the 
literature [12l [27l [HI [261 [D [21 [29] through the analysis of properties such as the restricted isometry property 
[H!, as well as using information-theoretic approaches. It has been established that T — n{K\og{N/K)) is a 
sufficient condition for support recovery. 

Another variant of the compressive sensing problem, 1-bit CS |S] is interesting as the extreme case of CS 
models with quantized measurements, which are of practical importance in many real world applications. 
The conditions on the number of measurements have been studied for both noiseless J18j and noisy |17j 
models and T — il{K log N) has been established as a sufficient condition for Gaussian sensing matrices. 

The identification problem was formulated in a channel coding framework in [l] and in the Russian 
literature [211 [HI 1121 [231 US] ■ Sufficient and necessary conditions on the number of tests in the group testing 
problem with i.i.d. test assignments were derived. One main difference between the Russian literature and 
[4j is that, in the earlier work, the number of defective items, K, is held fixed while the number of items, 
iV, approaches infinity. Consequently, the earlier work suggests that the number of tests must scale poly- 
logarithmically in N regardless of K for error probability to approach zero. In contrast, here [4 considers 
the fully high-dimensional setting wherein both the number of defectives as well as the number of items can 
approach infinity. The sufficient condition in j[4j was derived based on the analysis of a Maximum Likelihood 
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Figure 3: Group testing example and its mapping to the channel model. The codeviford X is the 
measurement matrix, which determines whether a sample is included in a test. The result of the first test is 

a false positive, while the last test is a false negative. 
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Figure 4: The channel model characterization of the sparse channel estimation problem. 



(ML) decoder, while the necessary condition was derived using Fano's inequality [TU]. This analysis was 
extended to general sparse signal processing models in pi . 

In this paper, we are concerned specifically with the analysis of the problem with i.i.d. variables X, which 
allows analysis of a large number of important problems, such as the classical group testing or compressive 
sensing models. This paper presents a more thorough analysis than |3], including analysis of problems 
with latent variable observation models, formally extending the analysis to continuous models, presenting 
results for the K = o{N) scaling regime and analysis of linear and 1-bit compressive sensing problems as 
applications. 

In Section [2] we provide a general description of the problem. In Section [3J we state necessary and 
sufficient conditions on the number of samples required for recovery. We provide remarks for continuous 
models in |3.3| and wc consider the scaling regime of K in |3.4| Applications are considered in Section 4 
including bounds for group testing and compressive sensing models. We summarize our results in Section 5 



2 Problem Setup 



Wc introduce our notational convention that will be used throughout the paper. We use upper case letters to 
denote random variables, vectors and matrices, while we use lower case letters to denote scalars, vectors and 
matrices, to distinguish between random quantities and their realizations. Subscripts are used for column 
indexing and superscripts with parentheses are used for row indexing in vectors and matrices. Subscripting 
with a set S implies the selection of column with indices in S. Table IT] provides a reference and further 
details on notation used, log is used to denote logarithm to the base 2 and natural logarithm is denoted by 
In. 

Let X = (Xi, X2, ■ ■ ■ ; Xm) G X^ denote a set of i.i.d. random variables with a joint probability distri- 
bution Q{X). To simplify the expressions, we suppress the subscript indexing with random variables since 
the distribution function is determined solely by the number of variables indexed. 

We index the different sets of size K as S^ with index w, so that S^ is a set of K indices corresponding 
to the w-th set of variables. Since there are N variables in total, there are (^) such sets, hence 



uj el= {1,2 



Table 1: Reference for notation used 





Random quantities 


Corresponding realizations 


Variables 


Xi, . . . , Xn 


Xl,...,XN 


1 X A^ random vector 


X = iX,,...,XN) 


x^ {xi,...,xn) 


1 X |5| random vector 


Xs 


xs 


Tx N random matrix 


X^ 


x^ 


t-th row of X'^ 


xw 


xw 


n-th column of A^ 


XI 


T 


n-th element of t-th row 


a(*) 


(t) 
Xn 


Tx S" sub- matrix 




Xg 


Outcome 


Y 


y 


T X 1 vector of outcomes 


Y^ 


y^ 


t-th element of F^ 


y(t) 


yit) 



We let Y G y denote an observation or outcome, which depends only on a small subset of variables 
S C {1, . ■ ■ , N} of known cardinality \S\ — K where K <^ N . In particular, Y is conditionally independent 
of the variables given the subset of variables indexed by the index set S, as in ([I]), i.e., 

P{Y\X) = PiY\Xs) 

where Xs — {Xk}kes is the subset of variables indexed by the set S. 

We consider a more general observation model compared to [1] and [5], where the observation model is 
not completely deterministic and known, but depends on a latent variable (3s ■ We assume f3s is independent 
of variables X and has a prior distribution P{j3s). The outcomes depend on both Xs and (3s and are 
generated according to the model P{Y\Xs, (3s)- As an example, this latent variable corresponds to the 
non-zero coefficients of the iiT-sparse vector (3 in the CS framework in Section [4. 1.1[ or the impulse response 
coefficients in the sparse channel estimation framework. Note that (fTl) still holds in this model. 

We use lower-case p{Y\Xs) notation for the conditional outcome distribution given the true subset of 
variables averaged over the latent variable (3s. In some cases when we would like to distinguish between the 
outcome distribution conditioned on different sets of variables we use Pt^{- \-) notation, to emphasize that 
the conditional distribution is conditioned on the given variables, assuming the true set S* is S^^. 

We observe the realizations {x'^,y^) of T variable-outcome pairs {X'^,Y'^) with each sample realization 
(a;(*), ?/(*)) of (X(*),y(*)), t = 1,2,...,T. The variables X^ are distributed i.i.d. across t = 1,...,T. 
However, the outcomes y'*' are independent for different t only when conditioned on (is- Our goal is to 
identify the set S from the data samples and the associated outcomes {x'^,y'^), with an arbitrarily small 
average error probability. 

We let S{X'^ , Y'^) denote the estimate of the set S which is random due to the randomness in X and Y. 
Conditioned on a particular set S, we define the conditional error probability P{E\S) as an average error 
probability over all possible realizations of data samples X'^ and outcomes K"^, i.e., 

P{E\S) = Pi[S{X'^, Y'^) ^ S\S] (3) 

where the randomness is over the variables X^ and the outcome Y^ . Also let X^t{S) denote the average 
error probability conditioned on a particular S and a given realization of the T x N data samples matrix 
x^. Hence, 

X^T (5) = Pt[S{X'^, r^) ^ S\S, X^ = x^] (4) 

where the randomness is over the outcome Y^ . Given ^ and Q we have 

p{e\s) = J2KHS)Q{x'^) 

We further let P{E) denote the average probability of error, averaged over all sets S of size X, all possible 
data samples X^ and outcomes Y^ , i.e., 

P{E) = Vt[S{X'^ ,Y'^) T^ S] 

We assume that all sets are equally likely. So by symmetry, it is easy to see that the average error 
probability does not depend on the set S and we can assume without loss of generality that cj = 1, i.e.. Si 
is the true set. 

Lastly, for any two sets Si and Sj^ we define 5ij-, Sicj, and Sij<i as the overlap set, the set of indices in 
Sj but not in Si , and the set of indices in Si but not in Sj , respectively. Namely, 

5'ij- — SiC\ Sj overlap 
Sic^j — Si n Sj in i but not in i 
Sijc — SiCi Sj in i but not in j 



3 Conditions for Recovery 

In this section we state and prove sufHcient and necessary conditions for the recovery of sahent set S with 
an arbitrarily small average error probability, for discrete variables X and for the sparsity regime where 
the support size K is fixed with respect to dimension N. The extensions to continuous variables and high- 
dimensional regime with K — o{N) scaling are considered in the subsequent sections. 

Central to our analysis are the following three assumptions, which we utilize in order to analyze the 
probability of error in recovering the salient set and to obtain sufficient and necessary conditions on sample 
complexity. 

• Equi-probable support: Any set S^j C {!,..., N} with K elements is equally likely a priori to be 
the salient set. This assumption implies that we have no prior knowledge of the salient set S among 
(^) sets in I. 

• Conditional independence: The observation/outcome Y is conditionally independent of other vari- 
ables given Xs, variables with indices in S, i.e., 

P{Y\X) = P{Y\Xs). 

This assumption follows directly from the definition of the sparse recovery problem. 

• IID variables: The variables Xi, . . . ,Xn are independent and identically distributed. While the 
independence assumption is not valid for all sparse recovery problems, many problems of interest can 
be analyzed within the i.i.d. framework, as in Section |4] 

In many sparse recovery problems, we are concerned with the recovery of an underlying sparse vector /3, 
which has a sparsity support S and coefficients /3s on the indices in the support. The observation model 
inherently depends on the values of these coefficients in such problems. 

Defining the support coefficients as latent variables in our observation model, as stated in Section[2l such 
that 

P{Y\X) = P{Y\Xs) = JpiY\Xs,Ps)PWs)dPs, 

we are able to analyze such problems while taking their observation structure into consideration. 
For instance, a simple example is the following linear observation model, where 

Y = X^P + W ^ X]l3s + W, 

with noise W, which exhibits such structure; along with extensions to non-linear models, where 

Y ^ fiXj^s) + W, 

for a function / : M — > M. 

3.1 Sufficiency 

To derive the sufficiency bound for the required number of samples, we analyze the error probability of a 
Maximum Likelihood (ML) decoder 16 . The decoder goes through all (^) possible sets of size K, and 
chooses the set S*^* for which outcome Y^ is most likely, i.e., 

p{Y^\Xl,) >p{Y^ iXl); yu^u*. (5) 

An error occurs if any set other than the true set 5*1 is more likely. This ML decoder is a minimum probability 
of error decoder assuming uniform prior on the candidate sets of variables. Note that the ML decoder requires 
the knowledge of the observation model P{Y\Xs, Ps) and the prior P{Ps)- Next, we derive an upper bound 



on the average error probability P{E) of the ML decoder, where the average is taken over all sets, data 
realizations and observations. 

Define the error event Ei as the event of mistaking the true set for a set which differs from the true set 
Si in exactly i variables. The probability of such an event is denoted P{Ei). The event E^ implies that there 
exists some set which differs from the true set in i variables and is more likely to the decoder. Hence, 

P{Ei) < Pr [Bcu ^ 1 : p(r^|Xjj > p(y^|Xjj, where \Si.,^\ = |5i,„e| = i, and |5i| = \S^\ = k\ (6) 
The probability P{Ei) can be written as a summation over all inputs Xj and all outcomes Y^ 

P^E^) - 5]^g(Xjjp(r^|XjjPr[i?,K - l,Xl,Y^] (7) 

where Pr[i?,;|wo — li-'^J ^Y'^] is the probability of decoding error in exactly i variables, conditioned on the 
true index wq = 1, the realization Xj for the set 5*1, and on the sequence Y"^ . This can be viewed as the 
error probability for a communication system with a transmitted message wq = 1, encoded message Xj 
and received sequence Y^ . Using the union bound, the conditional error probability averaged over data 
realizations is upper bounded by 

K K 

P{E\Si) <Y.P{E,) = ^^^Q(Xjjp(r^|XjjPr[i5,|c.o = hX^^Y^] (8) 

Next we state our main result. The following theorem provides a sufficient condition on the number of 
samples T for an arbitrarily small average error probability. 

Theorem 3.1. (Sufficiency). Define S^ as the set of tuples {S^ , S^) partitioning the true set S into disjoint 
sets S^ and S^ with cardinalities i and K — i, respectively, i.e., 

4'^ ^ {{S\S^) : S^ n S^ ^ il>,S^ U S^ ^ S,\S^\ ^ i,\S^\ ^ K - iy (9) 

// the number of samples T is such that 

log(^-^)(^) 

r>(l + 6). max /\^ '\;>R V (1"^) 

1=1,. ..,K I(Xsi;Xs2,Y\fJs) 

then, asymptotically the average error probability approaches zero, i.e., 

lim lim P{E) = 0, 

where e > is an arbitrary constant independent of N and K and /(X51 ; X52, yj/^s) is the mutual infor- 
mation \10^ between Xgi and {Xgi^Y) conditioned on Ps- 



Theorem 3.1 follows from a tight bound — based on characterization of error exponents as in [16j — on the 
error probability P{Ei). We will show that the error exponent, Eo{p), is described by: 

-1 i+p 
Eo{p)^^^\ogJ2Yl 



T 



Y,Q{Xl,)p{Y^,XlAX^,)^^ 0<p<l (11) 

where, {S^^S"^) £ 2g , defined in (lo]), denoting any disjoint partitions of the set of variables Si with 
cardinalities i and K — i, respectively. Xji and Xja are the corresponding disjoint partitions of the T x K 
input Xj of sizes T x i and T x (K — i), respectively. We state the following lemma, which upper bounds 
the probability of decoding error in i variables: 



Lemma 3.1. The probability of the error event Ei defined in ([7| that a set which differs from the set Si 
in exactly i variables is selected by the ML decoder (averaged over all data realizations and outcomes) is 
bounded from above by 



-T[ EJp)-p 

P{E,) < 2 
The proof of Lemma |3.1| is provided in the Appendix. 

Proof of Theorem [3TT] 



'°«rT")(T) 



(12) 



We need to derive a sufficient condition for the error exponent of the error probabihty P{Ei) in (12) to be 
positive and to drive the error probabihty to zero as A'^ — > oo. Specificahy, 



where 



Tf{p) = TE,{p)-p\og 



f{p) = E,{p)-p 



N -K\ K 



(13) 



logr7")(f) 
T 



and where Ec,{p) is defined in (11). 

To estabhsh (10) we follow the argument in Il6j. Note that /(O) = 0. Since the function f{p) is 
differentiable and has a power series expansion, for a sufficiently small 5, we get by Taylor series expansion 
in the neighborhood of p G [0, 5] that, 



/(p) = /(0)+P 



p=0 



0{p') 



But we can show that 



OK 
dp 



p=o T 



YT xi 






xi 



xi 



which simplifies to 
dEo 



^ Xs2 Xgi si 

_ I{Xgi;Xg2,Y ) 



(14) 



T 
Note that we can further decompose /(Aji ; Aja, Y'^) using the following chain of equalities: 

/(Aji ; Aj. , Y^) + /(/3s; X^, | AJ. , Y^) = I{X^, ; A|. , F^, /3s) = /(Xji ; /3s) + /(AJi ; A|. , Y^\Ps) 

= TI{Xsi;Xs2,Y\Ps), 

where the last equahty is due to X and /3s being independent and (A-^,y-^) pairs being independent over t 
given /3s. Therefore we have 



dEo 
dp 



I{Xgi;Xg2,Y ) 



p=0 



/(A,.;A,.,y|fe)- ^(^^'^^-J.^^-^) . (15) 



Now assume that T satisfies 

T> ^ ' '^'' (16) 

which is imphed by condition ( 10 ). We note that from the Lagrange form of the Taylor Series expansion (an 
apphcation of the mean value theorem) we can write Eo{p) in terms of its first derivative evaluated at zero 
and a remainder term, i.e., 

EM^EM + pE'M + ^^Kii^) 



for some %jj € [0, p\. Hence, for the choice of T in ( 16 ) and using ( 15 1 we have 



where C — 2j(x -x ym ) which might depend on K. 



A preliminary analysis of ^ reveals that T = r2(log N), since log (^7^) (f ) == Q{i log N) and /(X51 ; X52 , Y\f3s) 
I{Xsi]Y\Xs2,(is) < H{Y) = 0(1). Also, I{Ps\X^,\X'^^,Y'^) < H[Ps), which is constant with respect to 
N since the observation model is only dependent on K variables, due to the sparsity assumption of the 
observation model P{Y\X). So we see that 

/(/35;XJ,|AJ.,0_^^ 1 



T \\ogN 



which is always dominated by /(X^i; X52, yj/Jg). Therefore (17 1 is asymptotically equivalent to 



Tf{p)>T\p^^J{Xs.;Xs.,Y\Ps)-p'CI{Xs.;Xs2,Y\Ps) 

Finally, if we choose p < jj, where e' — j^, then f{p) = S for some S > which docs not depend on N 
or T. It follows that T/(p) -> 00 as iV — > 00. 

We have just shown that for fixed K, T > (1 + e) • j,^ \^' — yji ^ is sufficient to ensure an arbitrarily 

small P{Ei). Since the average error probability P{E) < J2i=iE'{Ei), it follows that for any fixed K, 

limjv_j.oo J2i=i P{Ei) — 0. Consequently, since this is true for any K, limif_j.oo limAr-s-oo X]i=i Pi^i) = 0. 



Theorem 13. II now follows. 

a 

It is important to highlight the main difference between the analysis of the error probability for the 
problem considered herein and the channel coding problem. In contrast to channel coding, the codewords of 
a candidate set and the true set are not independent since the two sets could be overlapping. To overcome 
this diflficulty, we separate the error events Ei, i = 1, . . . ,K, oi misclassifying the true set in i items. Then, 
for every i we average over realizations of ensemble of codewords for every candidate set while holding fixed 
the partition common to these sets and the true set of variables. 

3.2 Necessity 

In this section we derive lower bounds on the required number of measurements using Fano's inequality |10) . 
We state the following theorem: 

Theorem 3.2. For N variables and a set S^^ of K salient variables, a lower bound on the total number of 
measurement required to recover the set is given by 

log (^-^+n 

T> max - — ^ ' '^—-, (18) 

- ^=l,...,K I{Xsi;Xs2,Y\l3s)' ^ ' 

10 



where the set ^g* is the set of tuples {S^,S^) partitioning the set Su: into disjoint sets S^ and S'^ with 
cardinalities i and K ~ i, respectively as defined in ^. 

Proof. The vector of outcomes Y'^ is probabilistically related to the index uj €l = {1, 2, . . . , (j^)}- Suppose 
K — i elements of the salient set are revealed to us, denoted by S^. From X'^ and Y^ we estimate the set 
index w. Let the estimate be w = g{X'^ ,Y'^). Define the probability of error 

Pe ^P{E) ^Pr[LJ=^Uj]. 

E is a binary random variable that takes the value 1 in case of an error i.e., if ci 7^ w, and otherwise, 
then using the chain rule of entropies |10] we have 

H{E, a;|r^, X^ , S^) = H{oj\Y^, X'^,S'^) + H{E\uj, F^, X'^,S^) 

= H(E\Y^ ,X'^ ,S^) + H{uj\E,Y'^ ,X'^ ,S^). (19) 

The random variable E is fully determined given X^ , Y^ , to and S"^ . It follows that H{E\ijj, Y^ , X^ , S"^) — 0. 
Since i? is a binary random variable H{E\Y^ , X^ ^S"^) < 1. Consequently, we can bound H{uj\E, Y^ , X^ ,S^) 
as follows, 

H{uj\E, Y^, X^, S^) = P{E = 0)H{lo\E = 0, Y^ , X^ , S^) + P{E = l)H{uj\E = 1, F^, X^ , S^) 
<(l-P.)0 + P.log(('^-f + ')-l 

<P.log("-f + ^), (20) 

The first inequality follows from the fact that revealing K — i entries, and given that _E = 1, the conditional 



entropy can be upper bounded by the logarithm of the number of outcomes. From ( 19 1, we obtain the genie 
aided Fano's inequality 

H{u\Y^, X^, 5^) < 1 + Pe log (^ " f ^ ') (21) 



Note that for the left hand term, we have 

H{uj\Y^, X^,S') = H{lo\S^) - I{uj; Y^ , X^\S') 



= H{u\S^) - I{oj; X^\S^) - I{lo- Y^\X^, S^) 

^'^H{u:\S^)-I{u-Y^\X^,S^) 

'=' H{uj\S^) - {H{Y^\X^,S^) - H{Y^\X^,u)) 



> H{u\S') (i?(r^|Xj.) - H{Y^\Xl)) 
'^^H{.\S')-I{Xl.,Y^\Xl.) 

where (a) follows from the fact that X^ is independent of 5^ and w; (b) follows from the fact that conditioning 
with respect to a; includes conditioning with respect to 5^ ; (c) follows from the fact that Y'^ depends on S^ 
only through Xja and similarly for the second term Y^ depends on lo only through X^ and finally we used 
the fact that conditioning reduces entropy in the first term to remove conditioning on X"^; the argument for 
(d) follows by definition. 



From (21), it then follows that 

H(uj\S'')-I{Xi;.;Y''\Xi;,) < 1 + Pelog 



^^ /T T^i Ts „, I N — K + i 
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and since the set S^ oi K — i variables is revealed, w is uniformly distributed over the set of indices that 
correspond to sets of size K containing S^ . It follows that 



log [ ^ j - /(Xji ■,Y^\X^,)<1 + P, log 

Rewriting the above inequality, we have 



N-K 

i 



Pe>l- 



I{Xl;Y^\Xl.) + l 



log 



'N-K+i\ 



(22) 



Thus, for the probability of error to be asymptotically bounded away from zero, it is necessary that 

'N -K + i 



log 



<l{Xi;,-Y'\Xi;.). 



Due to the independence of variables in X, we have 

/(Xj, ; r^ |Xj. ) = I{Xl ; Xj. , Y^) - I{Xl ; X|. ) = H^l ; Xj. , Y^) 



(23) 



(24) 



then, using (15), we can see that 

T> 



logrro 



-(5^5^)e42/(X5i;X52,r|/3s)- 



7(fe;x^i|x^,,y^) 



is a necessary condition for the number of samples T. 

Similar to the previous proof, we note that I{Xsi;Xs2,Y\Ps) = IiXsi;Y\Xs2, Ps) < H{Y) = 0(1) and 
log i^^f"^ = e(i log N). Therefore T = f)(log N) and ^(^^^Jil^J^'^^) jg dominated by /(X51 ; X52 , Y\(is), 



so that (18 1 satisfies above inequality asymptotically and is a lower bound on T. 



U 



Remark 3.1. The mutual information expressions in the denominators of (10 1 and (18) are identical, 



therefore the lower hound given in Theorem \3.S\ is order-wise tight as it matches the upper bound in Theorem 



Remark 3.2. Intuitively, the hounds in (10) and (18) can he explained as follows: For eachi, the numerator 



is the numher of bits required to represent all sets S'^j that differ from S in i elements. The denominator 
represents the information given by the subset S^ of K — i true elements and the output variable Y about 
the remaining i variables S^ . Hence, the ratio represents the numher of samples needed to control i support 
errors and the maximization accounts for all possible support errors. 

3.3 Continuous Case 

Even though the results and proof ideas that were used in sections |3.1| and |3.2| are fairly general, the proofs 
provided in |3.1| were stated for discrete variables and outcomes. In this section we make the necessary gener- 
alizations to extend these proofs to continuous variable and observation models. We follow the methodology 
in [16] and [IS]. 

To simplify the exposition, we consider the extension to continuous variables in the special case of fixed 



and known Ps- In that case, I{Xgi;Xg2,Y\/3s) reduces to I{X^i; Xg2,Y) and Eo{p) as defined in (11) 
reduces to 



Eo{p) 



loe 



Y Xc2 



[QiXsi)p{Y,Xs2\Xsir 



X 



si 



l+P 



0< p<l 



(25) 



12 



dp 



p=0 



= I{Xsi\Xs2,Y), since (X(*\y'^*^) pairs are independent across t for fixed /3c 



Assume the continuous joint variable probability density Q{X) with joint cumulative density function F 
and the conditional probability density p{Y = y\X = x) for the observation model, which is assumed to be 
a continuous function of both x and y. 

Let X' e X'^ be the random vector and Y' e y' be the random variable generated by the quantization 
oi X e X^ = M^ and F G 3^ = M respectively, where each variable in X is quantized to L values and Y 
quantized to J values. Let F' be the joint cumulative density function of X' . As before, let 5'(X-^,F-^) be 
the ML decoder with continuous inputs with probability of making i errors in decoding denoted by P{Ei). 
Let S'(X''^,y-^) be the ML decoder that quantizes inputs X^ and Y'^ to X''^ and Y''^ , and have the 
corresponding probability of error P'[Ei). Define 



K(p,x',y') = -iogE E 



y'^y x'^^eX'K-i 



J2 Qi^'sMy'^^'s^Ws^)'+' 



x'ex'^ 



i+p 



Eo{p,X,Y) ^~\og 



y JxK-^ 



nl+P 



(3(a;5i)p(j/, a;52|a;5i)i+p dx. 



51 



ATi 



dxs^ dy. 



where the indexing denotes the random variates which the error exponents are computed with respect to. 
Utilizing the results in|3.1|for the discrete models, we will show the following for the continuous model 



P{E,) < 2 



-T E^{p.X,Y)-p 



■°«rT")(f) 



(26) 

^IiXsi;Xs2,Y), 

P=u 

with the mutual information definition for continuous variables [TDj . 

Our strategy will be the following: we will increase the number of quantization levels for Y' and X' 



The rest of the proof will then follow as in the discrete case, by noting that — ^^ — — 



dp 



respectively and since discrete result ( 12 ) holds for any number of quantization levels, by taking limits we 
will be able to show that 



-Tl E„{p,X,Y)-p- 

P'{E,)<2 V y. (27) 

Since S{X'^ , Y'^) is the minimum probability of error decoder, any upper bound for P'{Ei) will also be 
an upper bound for P{Ei), proving (26). 

Assume Y is quantized with the quantization boundaries denoted by ai,...,aj_i, with Y' = aj if 
flj-i < Y < aj. For convenience denote ao = —00 and aj = 00. Furthermore assume quantization 
boundaries are equally spaced, i.e. aj — flj-i = Aj for 2 < j < J — 1. Now we can write the following 



E,{p,x'X)^-\ogY^Y. 



J = l x' 



S2 

J-1 



EQ(^50 



pi.V,x's2\x'gi)dy 



"si 



i+p 



= -iog E^-E 



J=2 

^E 
^E 



EQ(^50 






1+p 



EQ(^5^i 



E^(4ofr 



P{y,xs2\x'gi)dy 
Piyix's^Wsi)dy 



i+p 



i+p 



-si 



(28) 



(29) 



(30) 



(31) 
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Let J -> oo and for each J choose the sequence of quantization boundaries such that HmAj — 0, 
hma,7_i = cxD, hmai = — oo. Then the last two terms disappear and using the fundamental theorem of 
calculus, we obtain 



lim Eoip,X',Y')=Eoip,X',Y) 



Jy , 



51^(2^51 My, 4^ |a;si)^+' 



1+p 



dy. 



(32) 



It can also be shown that £'o(p, X\ Y') increases for finer quantizations of Y' , therefore Eo{p, X' , Y) gives 
the smallest upper bound over P'{Ei) over the quantizations of Y. 

We repeat the same procedure for X. Assume each variable Xn in X is quantized with the quantization 
boundaries denoted by 6i, . . . ,6l-i, with X'^ = bi if 6;_i < X„ < 6/. For convenience denote 5o = — oo 
and 6l = cxd. Furthermore assume quantization boundaries are equally spaced, i.e. 6; — 6;_i = A^ for 
2 < I < L — 1. Then we can write 



Eoip,X',Y)^-log IJ2 

" 1=1 

L 

E 



loe 



loe 



EQ(^5^: 



bi \ i+P 

p{y,xs2\x'gi)dxs2 

b,-i J 



T-+P 



dy 



1=1 

( L-l 



X^ 



b, \ i+P 

p{y,xs2\xsi)dxs2\ dF'{xsi) 

bi-i J 



1+P 



dy 



,Ea. 

y 1. 1=2 



Jb'_^p{y,xs^\xs^)'ixs2 



dF'ixsi) 



X' 



i+p 



+ 



X' 



6i \ i+P 

p{y,xs2\xsi)dxs2 dF'ixsi) 



(33) 



(34) 



+ 



x^ 



P{y,xs^\xs^)dxs2 dF'{xsi) \dy. 



(35) 



where (34) follows with F'(xsi) being the step function which represents the cumulative density function of 
the quantized variables X'^i . 

Let L — > oo, for each L choose a set of quantization point such that limA^ = 0, lim6i_i — oo, 
lim6i = — oo. Again second and third terms disappear and the first sum converges to the integral over Xg2. 
Note that p(y, xg2 \xgi) is a continuous function of all its variables since it was assumed that Q{x) and p{y\x) 
were continuous. Also note that limL^oo F' = F, which implies the weak convergence of the probability 
measure of X' to the probability measure of X. Given these facts, using the portmanteau theorem we obtain 
that Ef' [p{Y, Xs2\Xsi)] -)■ Ef [p{Y, Xs2\Xsi)], which leads to 



lim Eoip,X',Y)^~\og f f f p{y,xs2\xsi)^^dFixs 

L^^ Jy Jxi^-^ UX' 

This leads to the following result, completing the proof. 

,-^("7")(f) 



i+p 



dxs2dy^Eo{p,X,Y). (36) 



P{E,) < P'{E,) < hm 2 

J,L— J- OO 



-T E„(p,x',y')-p- 



-T E^ip.XX)-p 



'°«rT")(?) 



(37) 
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3.4 High-dimensional Case 



The results in Sections |3.1| provided sufficient conditions for decoding error to go to zero asymptotically, in 
the case of fixed K , i.e. the given conditions ensured that the following holds true, 

lim lim P{E) = 0. (38) 

AT— >oo W— >oo 

In this section we consider the case where K scales together with N by doing the necessary analysis to 
find the sufficient conditions for the following expression to hold true, 

lim P(E) = 0. (39) 

K=o(N) 

TV— i-oo 

Below, we state a sufficient condition on T for a scaling regime of the mutual information expression 

I{Xsi;Xs^,Y\Ps)- 

Theorem 3.3. If I {Xgi;Xg2 ^Y\f5s) scales with K and N such that 

liminf /(X5i;X52,r|/3s)>0, -^i = I, . . . ,K, {S\S'') e^f (40) 

then, 

T = n{KlogN) (41) 

samples are sufficient for identifying set S with arbitrarily small error probability. 
Proof. First, note the following, 

K 

P{E) < V P{E,) < K ■ inaxP{E,) = max K ■ P{E,) (42) 

1=1 

hence we have an extra \ogK term in the error exponent. This term did not exist in the main result since 
K was fixed. 

For notational convenience let / = I{Xgi;Xg2,Y\f3s) and E = max^gjg ^j \E'^{ip)\; then writing the 



Taylor expansion of Eo{p) and taking into consideration (42) and that T — fl{K log N), we have 



and our aim is to show that the above quantity approaches infinity for some p G [0, 1] a,s K, N ^ 00. 

The condition of the theorem, liminfx.Af-s-oo ^(-'^51; -^^52, Fj/^s) > for all i = 1, ..., iiT implies that 
there exists a 7 > 0, independent of K and N such that 

I{Xsi;Xs2,Y\(3s)>-f>0 (44) 

for all i, for large enough K and N. 

Then for T = cK log N, we see that ^^ -^ and ^(fe'^Jij^-^J^'^"^) ^ q (^nflv) ^ 0' ^o that the last 



two terms in (43) are dominated by pi. Therefore we obtain 



Tfip)>T(^pI-^E-p^^ (45) 

where by choosing p close enough to zero and leveraging the fact that E is bounded above by a constant, 
the second term can also be ignored, leaving us with 

Tf(p) >Tp(l- -\ (46) 

which tends to infinity for a large enough choice of constant c > 0, as / > 7. D 
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We present a corollary with a simplified condition that can be checked to determine if T = n{K log N) 
samples are sufficient, which directly follows from the above theorem. 



Corollary 3.1. The condition that 



hm I{Xs;Y\f3s)>0 



is necessary for T — Cl{K log N) samples to be sufficient for identifying set S with arbitrarily small error 
probability. 

4 Applications 

In the sections below, we analyze and state specific results for some problems for which our necessity and 
sufficiency results are applicable. In the first section, we look at linear observation models and derive results 
for linear compressive sensing (CS) with measurement noise as a specific example, along with a multivariate 
regression model, where we deal with vector-valued variables and outcomes. In the second section, we analyze 
quantized CS and group testing (Boolean CS) as examples of non-linear observation models. Finally, we 
look at a general framework where some of the variables are not observed, i.e., each variable is missing with 
some probability. 

4.1 Linear Settings 

4.1.1 Compressive Sensing 

Using the bounds presented in this paper for general sparse models, we derive sufficient conditions for the 
linear compressive sensing (CS) problem with measurement noise '12' and Gaussian sensing matrix with 
i.i.d. entries. 

We have the following normalized model [T], 

y^ ^ X'^13 + W'^ (47) 

where X'^ is the T x N sensing matrix, /3 is a K-sparse vector of length N with support S, W^ is the 
measurement noise of length T and Y'^ is the observation vector of length T. In particular, we assume Xn 
are Gaussian distributed random variables and the entries of the matrix are independent across rows t and 
columns n. Each element Xn is zero mean and has variance 7p. 

We let W^ be the observation noise of length T. We assume each element is i.i.d. with W ^ A/'(0, 



. SNR'- 
The coefficients of the support, /3s, are i.i.d. Gaussian random variables with zero mean and variance a . 

In order to analyze the CS problem using the proposed sparse signal processing framework, it is important 
to observe how the CS model as defined above relates to the general sparse model. In the case of CS, the 
elements in a row of the sensing matrix correspond to variables Xi , . . . , X^ as defined in Section l2] Each 
row of the sensing matrix is a realization of X and rows are generated i.i.d. to form X^ . It is easy to see 
that assumption (II]) is satisfied in both models, since each measurement F*-*-* depends only on the linear 
combination of the elements Xg that correspond to the support of /3. The coefficients of this combination 
are given by Ps, the values of the non-zero elements of /3. Ps corresponds to the latent parameter of the 
observation model P{Y\Xs,l3s), which encapsulates the noise W. 

For the following results, let a = j^ denote the support distortion, the ratio of misidentified elements of 
the support S. Note that -^ < a < 1- We state the following lemma. 

Lemma 4.1. 

I{Xs.;Xs2,Y\Ps)^E -ln{l+'^^^ j 

where the expectation is with respect to /3gi which are the coefficients in (3s corresponding to the indices in 
S\ 
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Proof. We write the mutual information term in the following chain of equalities, to obtain the lemma. 

I{Xsi;Xs2,Y\Ps)^hiY,Xs2\Ps)-HY,Xs2\Xsi,Ps) 
= hiY\Xs2,Ps)-hiY\Xs,Ps) 
^h{XJ,f3si+W\Ps^)-h{W) 



= E 

= E 



In 27re 



T 



^ln(l 



Pl,Ps^S NR\ 
T 



SNR 

) 



^^n'^^5ik 



n 



A closer analysis reveals that we can effectively take the expectation inside the logarithm and replace 
f3Jif3gi with its expectation, aKa^. Considering all values of a for exact recovery and noting that the 
numerator log ( ^ ){ i) — QiaKlogN), we then state the following theorem. 

Theorem 4.1. For compressive sensing with independent Gaussian sensing columns with SNR = J7(logA'^) 
(which is a necessary condition for recovery JJ^), T = 17 ( — '^ — 1 measurements are sufficient to recover S, 
the support of j3, with an arbitrarily small average error probability . 

The proof is provided in the Appendix. 

Remark 4.1. For the linear CS problem, we showed that our relatively simple analysis gives us a bound 
asymptotically identical to the best-known bound T = Q,{K \og{N / K)) IJj with an independent Gaussian 
sensing matrix, in the sublinear sparsity regime. Although we provided results for Gaussian distributed /3s, 
it is easy to obtain results for other cases such as fixed or lower bounded coefficients. 



4.1.2 Multivariate Regression 

In this problem, we consider the following linear model 
problems. 



], where we have a total of R linear regression 



Y{r} - X{r}P{r} 



Ws 



{r}> 



r = l, 



,R 



where for each r, /Sj^j G M^ is a i^-sparse vector, XT^ g E^^^ and Y?^^ G K^. The relation between 
different tasks r = 1, . . . ,i? is that all /3{r} share the same support S. This model is also called multiple 
linear regression or distributed compressive sensing 28 and is useful in applications such as multi-task 
learning [TU] . 

It is easy to see that this problem can be formulated in our sparse recovery framework, with vector-valued 
outcomes Y and variables X. Namely, let Y — (Y^ij, . . . ,Y^jij) e M^ be a vector-valued outcome, X ~ 
{Xj^-^, ..., Xj^^y e M^^^ be the collection of N vector-valued variables and (3 = (/3{i}, . . . , /3ir}) e M^^-^ 
be the collection of R sparse vectors sharing support S, making it block-sparse. This mapping is illustrated 



in Figure [5 
have the ro 



Assuming independence between A'r^} and support coefficients PsrX^s across r 
lowing observation model: 



1, 



, R, we 



R R ^ 

P{Y\X) - P{Y\Xs) = n PiY{r}\X{r}.s) = n / P(^{r} I^W,S, Pm ,s)P{^{r} ,s) d/3[r},S- 

We present the following theorem as a straightforward extension of Theorem |3.1[ 

Theorem 4.2. For the multiple regression model with R regression problems (with finite R), with independent 
matrices XT^ and support coefficients P{r}.s for different r, the following is a sufficient condition to identify 
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Figure 5: Mapping the multiple linear regression problem to a vector- valued outcome and variable model. 
On the left is the representation for a single problem r = 1. On the right is the corresponding vector 

formulation, shown for sample index t = 2. 



the joint support set S, 



(N-K\ rK\ 



max 

=1 



T > (1 + e) • 
with an arbitrarily small average error probability. 



log (";")(. 



(48) 



The proof follows directly from the decomposition of /(X51 ; Xg2 , Y\f3s), due to the independence of X, 
Y and /3s across different r. 

We present the following theorem for the specific linear model presented in Section |4.1.1[ as a direct 
result of Theorem 14.11 and Theorem 14.21 



Theor em 4. 3. For the multiple regression model with R problems, each given by the CS framework in 



Section 



4.1.1 



with i.i.d. Xij,} and /3{r},s across R problems, T = Q 



K log N 



measurem,ents are sufficient 



to recover joint support S with an arbitrarily small average error probability, with SNR — fl(\ogN). 

Remark 4.2. We showed that having R problems with independent measurements and sparse vector coeffi- 
cients decreases the number of measurements per problem by a factor of 1/R. While having R such problems 
increases the number of measurements R-fold, the inherent uncertainty in the problem is the same since the 
support is shared. It is then reasonable to expect such a decrease in measurements. 

4.2 Non-linear Settings 

4.2.1 1-bit Quantized Compressive Sensing 

As an example of a non-linear observation model, we look at the 1-bit compressive sensing problem [51 ll8[[T7] . 
We follow the problem setup of [17 . For the 1-bit CS model, we have 



Y' 



q{X^P + W^) 



(49) 



where X"^ is a T x N matrix with i.i.d. standard Gaussian elements, (3 is a, N x 1 vector that is X-sparse 
with support S and f3s = 1. W^ is a T x 1 noise vector with standard Gaussian elements. q{-) is a 1-bit 
quantizer which outputs 1 if the input is non-negative and otherwise, for each element in the input vector. 
This setup corresponds to the SNR — 1 regime in [T7] . 

To simplify the analysis and exposition, we analyze the degenerate case of /3 G {0,1}^, i.e. the latent 
variable j3s is known and equal to the vector of I's. However, the general case where /3s are random with 
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a known distribution can also be analyzed using the condition given by Theorem |3.1| In order to obtain 
the model-specific bounds, we analyze the mutual information term I(Xgi; Xg2,Y\l3s), which reduces to 
I{Xsi;Xs2,Y) for this case. 

Theorem 4.4. For 1-bit CS with i.i.d. Gaussian sensing matrix and the above setup, T — ^{KlogN) 
measurements are sufficient to recover S , the support of (3, with an arbitrarily small average error probability. 

The proof is provided in the Appendix. 

Remark 4.3. Similar to linear CS, for 1-bit CS with noise we provided a sufficiency bound that matches 
11 7| / for an i.i.d. Caussian sensing matrix, for the corresponding SNR regime. 

4.2.2 Group Testing - Boolean Model 

In this section we look at another non-linear model, group testing. The fundamental problem of group 
testing can be summarized as follows. Among a population of N items, K unknown items are of interest. 
The collection of these K items represents the defective set. The goal is to construct a pooling design, i.e., a 
collection of tests, to recover the defective set while reducing the number of required tests. In this case X'^ is 
a binary measurement matrix defining the assignment of items to tests. For the noise-free case, the outcome 
of the tests Y'^ is deterministic. It is the Boolean sum of the codewords corresponding to the defective set 
S. In other words 

y^ = \J xj. (50) 

Alternatively, if Ri € {0, 1} is an indicator function for the i-th item determining whether it belongs to 
the defective set (i.e. Ri — l\ii & S and Ri = otherwise), the outcome Y'-*' of the i-th test in the noise-free 
case can be written as 

N 

Y^'^^y xf^R, (51) 

where X\ is the i-th entry of the vector Xf, or equivalently, the binary entry at cell (i, t) of the measurement 
matrix X'^ . 

Theorem 4.5. For N items and K defectives, the number of tests T = Q.{K\ogN) is sufficient to identify 
the defective set S with an arbitrarily small average error probability. In other words, there is a constant c 
independent of N and K such that if T — cK log N then the average probability of error goes to zero. 

Our result also establishes upper and lower bounds on the number of tests needed for noisy versions of 
group testing, as well as worst-case errors. In particular, we consider testing with additive noise (leading to 
false alarms) and testing with dilution effects (leading to potential misses). We refer the reader to [1] for 
further details. 

4.3 Models with Missing Features 

Consider the general sparse signal processing model with independent variables, as considered in Section [3. 1| 
However instead of trying to infer S given the features X^ and outputs Y^ , assume we observe a T x N 
matrix Z^ instead of X'^ , with the relation 

^f^ = |^^'*'' "-P-^-^ y^e{l,...,N},te{l,...,T} 

[ 0, w.p. p 

i.e., we observe a version of the feature matrix which may have entries missing with probability p, indepen- 
dently for each entry. We show how the sample complexity changes related to the case where features are 
fully observed. 
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Theorem 4.6. Assume we observe Tmiss i-i-d. samples of missing variables Z and outcom,es Y in the missing 
data setup. Then, 

T 
T > 

is a sufficient condition for arbitrarily small average error probability in estimating S , where To is the 
sufficiency bound given by Theorem \3. 1\ for fully observed variables. 

Proof We try to derive /(Z51 ; Z52 , Y\/3s) in terms of /(X51 ; X52 , Y\(3s). To do that, we compute H{Y\Zs,l3s) 
for any set S. Define Ws as a binary vector with Bernoulli random variables such that Xs ■ Ws — Zs with 
scalar multiplication. To simplify the expressions, we omit the conditioning on I3g in all entropy expressions 
below. 

H{Y\Zs) =H(Y, Zs) - H{Zs) (52) 

=H{Y,Zs,Xs,Ws)-H{Xs,Ws\Y,Zs)-{H{Zs,Xs,Ws)-H(Xs,Ws\Zs)) (53) 

=H{Y\Zs, Xs, Ws) - {H{Ws\Zs) + H{Xs\Ws, Zs, Y)) + {H{Ws\Zs) + H{Xs\Ws, Zs)) (54) 

=H{Y\Xs) + H{Xs\Ws, Zs) - H{Xs\Ws, Zs, Y) (55) 

^H{Y\Xs) + Y, H{X,\W,, Zi) - H{X,\W,, Z,,Y) (56) 



=H{Y\Xs) + Y, {pH{X,\W^ = 0, Z,) + (1 - p)HiX,\W, = 1, Z,)) 
ies 

- {pH{X,\W., - 0,Z„Y) + {l-p)H{X,\W, = 1,Z,,Y)) (57) 

=H{Y\Xs) + YpH{X^)-pH{X,\Y) (58) 

ies 

=H{Y\Xs) + pH{Xs) - pH{Xs\Y) (59) 

= il-p)HiY\Xs)+pH{Y) (60) 



( 52 1 , ( 53 ) and ([54^ follow from the chain rule of entropy. ( 55 1 follows from the conditional independence 



of Y and Zs,Ws given Xg. (56) follows from the independence of Xi, Zi and Wi over i e S". In (57) we 



explicitly write the conditional entropies for two values of Wi^s. These expressions simplify to (58) since 



Xi = Zi ii Wi — 1 and Zi gives no information on Xi if Wi — 0. We group the terms over i £ S to obtain 



(59) and again use the chain rule of entropy to obtain the final expression. 
Then it simply follows that 

I{Zsi;Zs2,Y\/3s) = I{Zsi;Y\Zs2,l3s) ^ H{Y\Zs2,f5s) - H{Y\Zs,l3s) 

= (1 - p)HiY\Xs2 , Ps) + pH{Y\Ps) - (1 - p)H{Y\Xs, ps) - pH{Y\ps) 
= {l-p)I{Xsi;Xs2,Y\Ps) 

U 

As a special case, we obtain the following result for compressive sensing models with missing data [20| : 



Theorem 4.7. For the linear compressive sensing settings in Section J^.l.l with SNR — Q,{\ogN) and 



measurement matrix having missing entries with probability p, T = fi I ,j^_°^> 2 ) samples are sufficient to 
estimate S, support of (3. 

Remark 4.4. We observe that the number of sufficient samples increases by a factor of j^ for missing 
probability p. This example highlights the flexibility of our results due to the mutual information characteri- 
zation of the model; it is easy to compute new bounds for variations of any model due to this flexibility and 
obtain results for very general models such as this one. 
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5 Conclusions 

In this paper, we stated the results of our information-theoretic analysis of the sample complexity for salient 
variable identification in sparse signal processing models. We characterized sufficient and necessary conditions 
on the number of samples, for the case of i.i.d. variables. 

The results we obtain for necessary and sufficient number of samples are fairly general and applicable to a 
wide range of problems. The bounds only require the computation of simple mutual information expressions 
and characterize the trade-offs between parameters of the observation model such as SNR, number of variables 
N and the number of salient variables K. We provided examples of signal processing problems where such 
a framework applies and derived results for specific cases in Section [3] 
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6 Appendix 



Proof of Lemma 13.11 

For expositional clarity, we show the following weaker bound: 

iog("-^)("; 



P{E,) < 2 



T B„(p)- 



(A.l) 



Note that the main difference between the above equation and Lenima |3.1| is the missing p term multiplying 
the binomial expression. The main result follows along the same lines and we refer the reader to [3] for 
further details. 

To prove this weaker result we denote by Ai the set of indices corresponding to sets of K variables that 
differ from the true set 5*1 in exactly i variables, i.e., 



Ai^{uj el: \Sic 



iJS'a. 



K} 



We can establish that, 






(A.2) 



(A.3) 



ZJ ZJ ZJ '3(^Sic..)„ /yT x^ \xT 



Inequality (|A.3P is established separately in the Appendix. It follows that. 



Pr[i^,i-o = i,xi,Y^] < E E E Qi^LJ 



, Sic 



^xi 



< 



< 




p,{YT,XlJXl^J^ 



p^{Y\xl^\xl^y 

Pi{Y^,XljXlJs 

p^{Y^,xljxl^y 

P,{YT,XIJXIJ^ 



(A.4) 

(A.5) 

Vs > 0, < p< 1. 
(A.6) 



Inequality (A.4 1 follows from the fact that Pr[E'i|a;o = ^,X'g ,Y'^] < 1. Consequently, if U is an upperbound 
of this probabihty then it follows that, Pr[£'i|wo = l,Xj^,y^] < If for p e [0, 1]. InequaKty ( |A.5| follows 
from symmetry, namely, the inner summation is only dependent on the values of Xg ^ and not on the items 



in the set S'l^.tj- There are exactly 



(N-K\ 



possible sets ^i^^^ hence the binomial expression. Note that the 



sum over Si^^ cannot be further simplified. This is due to the fact that Xg is already specified since we 
have conditioned on Xg . Since Xg is fixed, the inner sum need not be equal for all sets Si^^i,^ G Ai- 
Finally, (A.6 1 follows from standard observation that sum of positive numbers raised to p-th power for p < 1 



is smaller than the sum of the p-th power of each number. 
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We now substitute for the conditional error probability derived above and follow the steps below: 



p(E^) = EEpi(^i^i'^'^)p^-[^^i^o = i.^J,.^^] 



Due to symmetry the summation over sets Si^^^ does not depend on lo. Since there are (^_j) sets Si^^ we 
get, 



YT xj" 



V^S.e 



Pl(^^^iJ^j,.J^ 



< I ^-^^ n E E E o(^J..Jpi(^J..,>^" I xi..) 



YT xZ Xj" 



2. ^^^--,Jp,(rT^^Tj^T^^J. 



^1f 



Y^ XT xl 






,xi 



N - K\ (K 



E E E Q^^l.Jv^''^'\xi^^.Y^ \xi_j 



1+p 



y^ xl 



\x% 

where the last step follows by noting that from symmetry Xj ^ is just a dummy variable and can be 
replaced by Xj ^. This establishes the weaker bound in (A.l). Further details about the proof of Lemma 
3.1 can be found in [i]. 



Proof of Equation A. 3 



Let Cwj '^ G -^i denote the event where a; is more likely than 1. Then, from the definition of Ai-, the 2 
encoded messages differ in i variables. Hence 



Pr[i?,|a;o = l,Xj,,r^] < P( (J CJ < E ^(C-) 



^^Ai 



ujeAi 



Now note that Xg shares {K — i) variables with Xj . Following the introduced notation, the common 
partition is denoted Xg , which is a T x {K — i) submatrix. The remaining i rows which are in Xg 
but not in Xg are Xg ^ . Similarly, Xg ^ corresponds to variables in Xg but not in Xg . In other 
words Xj^ = (Xj^^,x|^^J and X£ = (Xj^ ^,Xj^^ ^), where the notation (i^Txm. (^Txn^) denotes an 
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T X (rii + ^2) matrix with a subniatrix F in the first ni columns and G in the remaining 712 columns. Thus, 

X^s^:p{Y^\X^sJ>PiY^\Xl,) 



< E Qi^LJ- 



Vs > 0, Vw e A 



(A.7) 



xi 



'piY^lxiy 

By independence g(XjJ = Q(Xj^ „)'9(^Si ^c)- Similarly, QiX^J = <3(-^Si ^)Q(^Jic „)■ Since we are 



conditioning on a particular Xj , the partition Xj is fixed in the summation in (A.7) and 

p{Y^,xlJxl^J Qixljxlj^ 



piu< E Qi^L,J- 



XI 



Q^^lJ^hJ' P^Y^^^IJ^LJ' 



^ piY^.X^ \X^ 



x'i 



'piYT^XljXl^J^ 



(A., 



where the second inequality follows from the independence across variables, i.e. (5(Xj |Xj ^) = (3(Xj \Xj ^ ) 



Proof of Theorem 14.11 



Assume T = e(inogA^) and SNR = ri(logiV). Then we have -^'y^ = ri(f) and therefore from Lemma 
1411 



IiXsi;Xs2,Y) = n{E 
log II 



log 1 



We will now show try to show that E 
random variables. 



K 



K 
- 9(ao-^). Define the following sequence of 



log 1 



A 



K 



K 



then, we will have proven our claim if we can show that limi^^rx, E[Ax] — c for some constant c > 0. 
To show that, first note that Ak > 0, Ak < Ak+i and therefore Ak < Ai, for all X = 1, 2, . . .. Since 



E[Ai\ = E 



log(l + /3ji/350 



< 



\og{l + E[Pl,PsA) _log(l + aa2) 



< 00 



aa-^ 



(A.9) 



due to Jensen's inequality, we have shown all Ak are dominated by Ai and Ai has finite expectation. 
Therefore by the dominated convergence theorem, we have 



lim E[Ak] = E 



lim Ak 



But for large K, log 1 



PliPs^ 



^ . ,<i c '^'^'^^ for a constant c > 0, which can be easily seen from the 



Taylor expansion of the logarithm. Again, using the dominated convergence theorem, we then have 



lim E[Ak] = lim E 



,/3ji/35i 



Kaa'^ 



= lim c 



K->-oo Kaa^ 



= c. 
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showing I{Xsi;Xs2,Y) = Qiaa"^). 

Then, since log ( ~ ){ i) — ®(* l^S ^)j ^^ '^^^^ write 



N-K\/K\ 



^ aKlogN \ ^^fKlogN 



log (";")( 

I{Xsi;Xs2,Y) 
which is satisfied by T = 9 ^ ^'"f^ 'j , proving Theorem Wjj 

Proof of Theorem 14.41 

We write the mutual information term as 

IiXsi;Xs2,Y) = HiY\Xs2) - H{Y\Xs) 

where we will analyze H{Y\Xs2) and H{Y\Xs) to obtain a lower bound for the mutual information expres- 
sion. 

Defining Zi = Eje^i ^i' ^2 == J2jeS2 ^j and Z ^ Zi + Z2, we have //(FlX^a) = H{Y\Z2) since the 
quantizer input Xf3 + W depends only on the sum of the elements of Xs- Note that Z2 ~ A/'(0,-D^) with 
D^ = K — i. Now we explicitly write the conditional entropy 

H{Y\Z2)= [ Pz,{z)H{Y\Z2 = z)dz= [ Pz,{z) (p^log- + p^\og-\ dz (A.IO) 

J-00 J -00 \ Pi Po/ 

with pi = Pr[y = 1|Z2 = z] and po — 1 ~ Pi = Pi'[y = 0|Z2 = z], which can be written as 



Pi = Pr 



Zi+ Z2 + W >0 



Z2= Z 



= Pr [Zi + W> -z] = Pr [7V(0, S^) > -z] = Q 



S 



Pa = Pr 



Z1+Z2 + W <0 



Zo ^ z 



Pr [Zi + Vl^ < -z] = Pr [AA(0, 5^) < -z] = Q (|) 



where S"^ = i + 1 and the Q function defined as Q(x) = f°° -^e "2 dr. 

To lower bound H{Y\Z), we make use of the following inequalities for a; > [51 [TT]: 

1 2 1 ^2 



12 



a; 



l + ^<log(2eT)<log-l-<logl2 , 

2 y(2;) In 2 

Then we write the following chain of inequalities: 

H{Y\Z2)^2 f Pz,{z) fpilog-+polog-) dz 
Jo \ Pi Po/ 

>2 f Pz,{z)-po\og-Az 
Jo Po 

/•ex: 

>2/ 

1 



1 _ .2 1 _^2 

I , • e 2D^ • — • e s^ 

V2^^Z?2 12 



> + i)^= 



12V2^-D 7- 
1 /y2^ 



e-4V 



1+2^'^^ 



/V2 



12^2^7? I VA A3/2^2 ; i2Vli^ 24A3/2D52 



(A.ll) 
(A.12) 

(A.13) 
(A.14) 
(A.15) 
(A.16) 

(A.17) 
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Equality (A. 13) follows from the evenness of the function inside the integral and we write (A. 14) by noting 



that pi log — and Pz2 (z) arc non-negative. Pz2 {z) is expanded and the above bounds for the Q function 

S2 and rewriting the 



are used to obtain (A. 15) and (A. 16) is a regrouping of terms by defining A 



_D2 



limits of the integral by noting that the integrand is an even function. We obtain (A. 17) by evaluating the 

integral. For A, we have 

1 2 2K-i + l 

A 



K -i i+1 
and replacing A, D and S, we can then write 

V« + WK - i 



HiY\Xs2)^H{Y\Z2)>ci 



> c- 



ii + l){K-i) 



V2K~ 



iVk' 



C2 



1 



+ - + (l-a)\/a + 



{2K -i + l)3/2Vir^(i + 1) 
' 1 



K 



n a 



(A.18) 



for constants c, ci,C2 > 0. 

We now analyze the second term H{Y\Xs) to obtain an upper bound. Again, note that H{Y\Xs) 
HiY\Z), then 

H{Y\Z)^ r Pz{z)H{Y\Z^z)dz^ f Pz{z) (p^ log- + po log - 

Joo Joo \ Pi PO 

where this time we define pi ~ Pt[Y — 1\Z = z] and po — Pt[Y — 0\Z — z], which can be written as 

pi = Pt[Z + W >z\Z ^z]^ Pt[W > -z] = Pr[7V(0, 1) > -z] = Q{-z) 
Po = Pt[Z + W <z\Z = z]^ Pr[W < -z] = Pr[A/'(0, 1) < -z] = Q{z) 



Then, write the following chain of inequalities: 

H{Y\Z)^2 f Pz{z)(pilog- + {l-pi)log-^ 
Jo \ Pi 1 " Pi 



dz 



< 



4 y Pz{z) (p, log ^\ dz 



<4 



1 _^1 

p 2K — g 

V2^ 2 



log 12 



-BV 



log 12 



2 In 2 
dz 



dz 



y/2^J^oo" y""^" ' 21n2 

log 12 %/27r 1 V2tt _ log 12 

y/2TTK ^/B ^/2^2 In 2 ^3/2 ~ ./bK ' 2 In 2^/KB^/^ 



+ 



(A.19) 
(A.20) 
(A.21) 
(A.22) 
(A.23) 



Equality (A.19) follows from the evenness of the function inside the integral and we write (A.20) by noting 
that plog - > (1 — p) log YZ~ for < p < ^. Pziz) is expanded and the above bounds for the Q function are 
used to obtain (A.21 ) and (A.22 ) is a regrouping of terms by defining B = -^ + 1 and rewriting the limits of 



the integral by noting that the integrand is an even function. We obtain (A. 17) by evaluating the integral. 
Replacing B, we then have 



H{Y\Xs) = H{Y\Z) < ci 
for constants ci, C2 > 0. 



1 



C2 



VK + l \j, + l)VK + l 



O 



1 



^/WTl 



(A.24) 
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Looking at (A. 18) and (A. 24), wc have the followmg: 

IiXsi;Xs2,Y) = H{Y\Xs2) - HiY\Xs) = n^y^). (A.25) 

Finally, since log ( T )(^^)=0(i log N), we can write 

log(^T^)(^) /aK\oeN\ 

I{Xsi;Xs2,Y) \ ^a J 



which is satisfied by T = fl{K log N), proving Theorem 4.4 
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